Introduction to Math 154

Computational Statistics

Jo Hardin

2024-08-26

Agenda 8/26/24

  1. Syllabus
  2. Workflow - Stitch Fix
  3. Twitter
  4. Tools

Important

Before Wednesday, listen to the full conversation of Not So Standard Deviations - Compromised Shoe Situation.

Course structure

  • weekly HW (to GitHub + Gradescope)
  • bi-weekly quizzes
  • data project
  • in-class activities / clickers
  • ethical considerations

Additional details

  • Canvas has all the links
    1. course website – almost everything
    2. class notes
    3. Canvas page – solutions and assignments
  • no computers (tablets fine)
  • good communication
  • TidyTuesday

Syllabus

  • office hours
  • mentor sessions
  • anonymous feedback
  • dates for assignments
  • links to resources
  • HW grading
  • project information

Important

I need your GitHub user name - please email it to me.

Learning goals

By the end of the course, you will be able to…

  • gain insight from data
  • gain insight from data, reproducibly
  • gain insight from data, reproducibly, using modern programming tools and techniques
  • gain insight from data, reproducibly (with literate programming and version control), using modern programming tools and techniques

Data Science Process

Workflow

Stitch Fix

Example of how data and algorithms are used to make decisions.

http://algorithms-tour.stitchfix.com/

What can/can’t statistics & data science do?

  • Can model the data at hand!
  • Can find patterns & visualizations in large datasets.
  • Can’t establish causation (mostly).
  • Can’t represent data if it isn’t there.

Twitter

In 2013, DiGrazia et al. published a provocative paper suggesting that polling could now be replaced by analyzing social media data. They analyzed 406 competitive US congressional races using over 3.5 billion tweets. In an article in The Washington Post one of the co-authors, Rojas, writes: “Anyone with programming skills can write a program that will harvest tweets, sort them for content and analyze the results. This can be done with nothing more than a laptop computer.” (Rojas, 2013)

  1. The data come from neither an experiment nor a random sample - there must be careful thought applied to the question of to whom the analysis can be generalized. The data were also scraped from the internet.
  2. The analysis was done combining domain knowledge (about congressional races) with a data source that seems completely irrelevant at the outset (tweets).
  3. The dataset was quite large! 3.5 billion tweets were collected and a random sample of 500,000 tweets were analyzed.
  4. The researchers were from sociology and computer science - a truly collaborative endeavor, and one that is often quite efficient at producing high quality analyses.

Activity

Spend a few minutes reading the Rojas editorial. Be sure to consider Figure 1 carefully, and address the following questions.

Statistics Hat

  1. Discuss Figure 1 with your neighbor. What is its purpose? What does it convey? Think critically about this data visualization. What would you do differently?

  2. How would you improve the plot? I.e., annotate it to make it more convincing / communicative? Does it need enhancement?

  3. Do you think the study holds water? Why or why not? What are the shortcomings of this study?

Data Scientist Hat

Imagine that your boss, who does not have advanced technical skills or knowledge, asked you to reproduce the study you just read. Discuss the following with your neighbor.

  1. What steps are necessary to reproduce this study? Be as specific as you can! Try to list the subtasks that you would have to perform.

  2. What computational tools would you use for each task?

  3. Identify all the steps necessary to conduct the study. Could you do it given your current abilities & knowledge? What about the practical considerations?

Advantages

  1. Cheap

  2. Can measure any political race (not just the wealthy ones).

Disadvantages

  1. Is it really reflective of the voting populace? Who would it bias toward?

  2. Does simple mention of a candidate always reflect voting patterns? When wouldn’t it?

  3. Margin of error of 2.7%. How is that number typically calculated in a poll? Note: \(2 \cdot \sqrt{(1/2)(1/2)/1000} = 0.0316\).

  4. Tweets feel more free in terms of what you are able to say - is that a good thing or a bad thing with respect to polling?

  5. Can’t measure any demographic information.

What could be done differently?

  • Gelman: look only at close races

  • Gelman: “It might make sense to flip it around and predict twitter mentions given candidate popularity. That is, rotate the graph 90 degrees, and see how much variation there is in tweet shares for elections of different degrees of closeness.”

  • Gelman: “And scale the size of each dot to the total number of tweets for the two candidates in the election.”

  • Gelman: Make the data publicly available so that others can try to reproduce the results

https://statmodeling.stat.columbia.edu/2013/04/24/the-tweets-votes-curve/

Toolkit: Computing

We use tools to do the things. But the tools are not the things.

Reproducible data analysis

Reproducibility checklist

What does it mean for a data analysis to be “reproducible”?

Short-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability \(\rightarrow\) R
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
  • Version control \(\rightarrow\) Git / GitHub

R and RStudio

R and RStudio

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics
  • It’s easily extensible with packages

RStudio logo

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R vs. RStudio

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

R packages

  • Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1

  • As of August 26, 2024, there are 21,145 R packages available on CRAN (the Comprehensive R Archive Network)2

  • We’re going to work with a small (but important) subset of these!

Tour: R + RStudio

Tour recap: R + RStudio

A short list (for now) of R essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
  • Packages are installed with the install.packages() function and loaded with the library function, once per session:
install.packages("package_name")
library(package_name)

R essentials (continued)

  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
  • Object documentation can be accessed with ?
?mean

tidyverse

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The tidyverse is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

Quarto

Quarto

  • Fully reproducible reports – each time you Render, the analysis is run from the beginning
  • Code goes in chunks
  • Narrative goes outside of chunks

Tour: Quarto

Tour recap: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

Environments

Important

The environment of your Quarto document is separate from the Console!

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments

First, run the following in the console:

x <- 2
x * 3


All looks good, eh?

Then, add the following in an R chunk in your Quarto document

x * 3


What happens? Why the error?

How will we use Quarto?

  • Every application exercise, lab, project, etc. is an Quarto document
  • You’ll always have a template Quarto document to start with
  • The amount of scaffolding in the template will decrease over the semester

Toolkit: Version control and collaboration

Git and GitHub

Git logo

  • Git is a version control system – like “Track Changes” features from Microsoft Word, on steroids
  • It’s not the only version control system, but it’s a very popular one

GitHub logo

  • GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better

  • We will use GitHub as a platform for web hosting and collaboration (and as our course management system!)

Versioning - done badly

Versioning - done better

Versioning - done even better

with human readable messages

How will we use Git and GitHub?

How will we use Git and GitHub?

How will we use Git and GitHub?

How will we use Git and GitHub?

Git and GitHub tips

  • There are millions of git commands – ok, that’s an exaggeration, but there are a lot of them – and very few people know them all. 99% of the time you will use git to add, commit, push, and pull.
  • We will be doing Git things and interfacing with GitHub through RStudio, but if you google for help you might come across methods for doing these things in the command line – skip that and move on to the next resource unless you feel comfortable trying it out.
  • There is a great resource for working with git and R: happygitwithr.com. Some of the content in there is beyond the scope of this course, but it’s a good place to look for help.

Tour: Git + GitHub

Agenda 8/28/24

  1. Reproducibility
  2. GitHub
  3. NSSD

Important

Before next Tuesday, read: Tufte. 1997. Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)

Reproducibility

Example #1

Science retracts gay marriage paper without agreement of lead author LaCour

  • In May 2015 Science retracted a study of how canvassers can sway people’s opinions about gay marriage published just 5 months prior.

  • Science Editor-in-Chief Marcia McNutt:

    • Original survey data not made available for independent reproduction of results.
    • Survey incentives misrepresented.
    • Sponsorship statement false.
  • Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.

  • Methods we’ll discuss can’t prevent this, but they can make it easier to discover issues.

Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent

Example #2

Seizure study retracted after authors realize data got “terribly mixed”

  • From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:

“The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.”

Source: http://retractionwatch.com/2013/02/01/seizure-study-retracted-after-authors-realize-data-got-terribly-mixed/

Example #3

Bad spreadsheet merge kills depression paper, quick fix resurrects it

  • The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.

  • Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].

  • Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].

Source: http://retractionwatch.com/2014/07/01/bad-spreadsheet-merge-kills-depression-paper-quick-fix-resurrects-it/

Reproducible data analysis

  • Scriptability → R [in contrast to pull down menus]

  • Literate programming → R Markdown [in contrast to multiple files]

  • Version control → Git / GitHub [in contrast to multiple versions]

Scripting and literate programming

Donald Knuth “Literate Programming (1983)”

“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do.”

  • These ideas have been around for years!
  • and tools for putting them to practice have also been around
  • but they have never been as accessible as the current tools

Reproducibility checklist

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
  • Can the code be used for other data?
  • Can you extend the code to do other things?

Tools: R & R Studio

  • You must use both software programs
  • R does the programming
  • R Studio brings everything together
  • You may use Pomona’s server: https://rstudio.campus.pomona.edu/ (or https://rstudio.pomona.edu if you are off campus)
  • See course website for getting started: https://m154-comp-stats.netlify.app/github.html

R vs R Studio

R: think “python”

R Studio: think “jupyter notebook” or “Google Colab”

Taken from Modern Drive: An introduction to statistical and data sciences via R, by Ismay and Kim

R Studio

Jessica Ward, PhD student at Newcastle University

Tools: GitHub

  • You must submit your assignments via GitHub + Gradescope
  • Follow Jenny Bryan’s advice on how to get set-up: http://happygitwithr.com/
  • Follow course specific advice: https://m154-comp-stats.netlify.app/github.html

Steps for weekly homework

  1. You will get a link to the new assignment (clicking on the link will create a new private repo)
  2. Use R Studio
    • New Project, version control, Git
    • Clone the repo using SSH
  3. If it exists, rename the Rmd file to ma154-hw#-lname-fname.Rmd
  4. Do the assignment
    • commit and push after every problem
  5. All necessary files must be in the same folder (e.g., data)

Tools: a GitHub merge conflict (demo)

  • On GitHub (on the web) edit the README document and Commit it with a message describing what you did.

  • Then, in RStudio also edit the README document with a different change.

    • Commit your changes
    • Try to push – you’ll get an error!
    • Try pulling
    • Resolve the merge conflict and then commit and push
  • As you work in teams you will run into merge conflicts, learning how to resolve them properly will be very important.

Tools: a GitHub merge conflict

NSSD:

  1. What was Hilary trying to answer in her data collection?

  2. Name two of Hilary’s main hurdles in gathering accurate data.

  3. Which is better: high touch (manual) or low touch (automatic) data collection? Why?

  4. What additional covariates are needed / desired? Any problems with them?

  5. How much data does she need?